The purpose of this course is to teach you the basics of the R language and give you the confidence to tackle larger projects using the language. Importantly, we want to get you thinking like a programmer. This doesn’t mean that by the end of the course you will know R fully, but you will know enough so you can go online and look for the help you need to complete most tasks.
Programming is like any skill, the more you practice the better you get. It’s really important that you keep using what you have learned after the course is completed otherwise there is a good chance you will forget everything and you’ll be back to square one.
R is a high-level programming language with a focus on mathematics and statistics, but R can be used for a wide variety of applications given the flexibility of the language. R is also free, and available for all operating systems. Given the richness of the language and no cost to use it, bioinformaticians have used R for more than 20 years as the platform for which which to develop packages to solve bioinformatics problems.
The BioConductor Project is a repository for bioinformatics tools which continues to grow, and hosts packages such as DESeq2 which you may have heard of. Some other popular packages such as Seurat aren’t actually hosted by Bioconductor, but in the main R package repository. We’ll cover package installation later later.
The Python language has been rocketing in popularity for the past few years, particularly among data scientists who make use of the AI/ML tools such as Tensorflow and PyTorch. Scanpy is a very popular package for single-cell analysis. For very computationally intensive tasks (e.g sequence alignment), languages such as C/C++/Rust are more commonly used, but these are far more difficult to learn.
We’re going to take a different approach to this course. You will be taught the basics of the R language while doing small exercises along the way. However, we will finish by you undertaking a project which will push you quite hard. The aim is that by tacking a more difficult problem will consolidate what you have learnt, and learn more by having to look up solutions to the problems you will likely face.
Point your browser to http://cran.r-project.org/ to download and install the latest version of R. For these tutorials we are also going to use RStudio which is an advanced development environment for R which includes a window for an editor, console, and plotting window. You will see what this means later.
Open up RStudio, and it will look something like this:
The different parts are:
Before we start, we need to do a little prep.
We then set the working directory to this folder, so
RStudio will now be looking for files in this folder, and any saved plots will be put here unless stated otherwise.
Now, go to File > New File > R Script
A new empty script will open up in the top left window. Go to File > Save and give it a name. It will be saved to you current working directory. You should see your file being added to the list in the Plot and Files pane.
Now that we’ve done our prep, let do some R.
We’ll now look at some basic operations. The code should be copied into your R script as we go along.
Into your script copy/type the following line:
This will make a vector of values from 1 to 10, and
put them into a variable called x.
Execute the code by hitting the “Run” button at the top-right of the script window. You will see this line appear in the R console below.
To view the contents of the object you have just created, just type
x in the console and hit return:
## [1] 1 2 3 4 5 6 7 8 9 10
The contents of x are now printed out.
Now is a good time to learn about commenting and documenting
code. This is free text you put into your scripts that tell the
reader whats going on. It also helps remind your future self of what you
did or your thought process. Comments are put in using #,
so for example:
Anything after a # will be ignored by R. You can run the
code again to check.
Rather than typing in the value 1 to 10, there is a much simpler way
to create the same vector using :
## [1] 1 2 3 4 5 6 7 8 9 10
Much better! Using a colon will always do increments of 1, and it’s also bidirectional:
## [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
Another way of creating a sequence of numbers is to use the
seq function. To learn how this function works, issue the
command help(seq). In R you can get a manual for any
function using the help() command (or
e.g. ?seq). To generate a vector of numbers from 1 to 100
in steps of 10 we need:
## [1] 0 10 20 30 40 50 60 70 80 90 100
Exercise: Generate a vector called ‘b’
ranging from 3 to 987 where the length of the vector is 53 entries long.
Done? Check the length of the vector you have just made by issuing
length(b).
A note about assigning things to variable names. I use
<-, but you can also use = too. So
## [1] 1 2 3 4 5
Is the same as:
## [1] 1 2 3 4 5
We can also make a new vector d using a vector
c:
## [1] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000 0.16666667
## [7] 0.14285714 0.12500000 0.11111111 0.10000000 0.09090909 0.08333333
## [13] 0.07692308 0.07142857 0.06666667 0.06250000 0.05882353 0.05555556
## [19] 0.05263158 0.05000000 0.04761905 0.04545455 0.04347826 0.04166667
## [25] 0.04000000 0.03846154 0.03703704 0.03571429 0.03448276 0.03333333
## [31] 0.03225806 0.03125000 0.03030303 0.02941176 0.02857143 0.02777778
## [37] 0.02702703 0.02631579 0.02564103 0.02500000 0.02439024 0.02380952
## [43] 0.02325581 0.02272727 0.02222222 0.02173913 0.02127660 0.02083333
## [49] 0.02040816 0.02000000
And do maths on them:
## [1] 0.08998411
## [1] 0.1578087
This is important stuff. If we want to ask whether something is equal
to something else, we need to use the == operator, NOT
=. Try this:
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
you will see that all are marked FALSE apart from the
5th which is TRUE. If you want to know specifically which
element equals 5, use which:
## [1] 5
And you’ll see it now says the 5th element of vector x equals 5. A few more:
Lets make another vector y:
Finding the intersect is easy:
## [1] 7 8 9 10
What do you think this does?:
Lets use this vector d and plot them:
Note the way the axes are labelled in the plot function.
Exercise: Call help(plot) in
the console and read about the other plot options available. Produce the
same plot as above, but this time as a line which is
coloured red. Also, label the axes \(c\) and \(d\) and give the plot a title.
This plot was done using base R. This means we used the plotting functions native, or built-in to the R language. Plotting has been vamped up a lot in R since ggplot was introduced some time ago, the aim here to make plotting more customisable. Lets try the same plot but using ggplot. First, install it:
Now call the ggplot library:
The first thing we need to do is make a data.frame of the data by doing:
This makes a table. You can see the first few lines by doing:
## col1 col2
## 1 1 1.0000000
## 2 2 0.5000000
## 3 3 0.3333333
## 4 4 0.2500000
## 5 5 0.2000000
Lets make the plot:
We can even add the points:
and change the colour of each:
So ggplot is all about layering information onto plots, and this makes it very customisable. It definitely has a steeper learning curve, but worth it if you produce lots of plots on a day-to-day. If you’re struggling with the idea, use this analogy of ggplots as a cake (Tanya Shapiro (@tanya_shapiro), Twitter):
Matrices are the most common data format bioinformaticians work with (microarray/RNAseq data for example). Lets make one:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
## [6,] 0 0 0 0 0
## [7,] 0 0 0 0 0
## [8,] 0 0 0 0 0
## [9,] 0 0 0 0 0
## [10,] 0 0 0 0 0
This will create a matrix filled with zeros. To transpose (flip) the
matrix we use t() (this will be important later!)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0
We can also add names to the row and columns:
## cat dog pig cow chicken
## A 0 0 0 0 0
## B 0 0 0 0 0
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 0 0
## F 0 0 0 0 0
## G 0 0 0 0 0
## H 0 0 0 0 0
## I 0 0 0 0 0
## J 0 0 0 0 0
Lets make a matrix (and a vector) containing integer values so we can take a look at how subsetting work in R:
v <- 1:10
m <- t(matrix(1:50, ncol = 10, nrow = 5))
rownames(m) <- LETTERS[1:10]
colnames(m) <- c("cat", "dog", "pig", "cow", "chicken")
m## cat dog pig cow chicken
## A 1 2 3 4 5
## B 6 7 8 9 10
## C 11 12 13 14 15
## D 16 17 18 19 20
## E 21 22 23 24 25
## F 26 27 28 29 30
## G 31 32 33 34 35
## H 36 37 38 39 40
## I 41 42 43 44 45
## J 46 47 48 49 50
We can access individual elements using square brackets
[]. Here are some examples:
## [1] 7 1 5
## cat dog pig cow chicken
## 1 2 3 4 5
## A B C D E F G H I J
## 3 8 13 18 23 28 33 38 43 48
## [1] 37
## C D E F G
## 14 19 24 29 34
Note the c(7,1,5) where we subset vector v.
c means combine and it allows element in the
() to be collected into a single vector.
We can also address elements using the column and row names:
## cat dog pig cow chicken
## 6 7 8 9 10
## [1] 9
## chicken cat pig
## F 30 26 28
## J 50 46 48
We often need to collect vectors and assemble them into a matrix.
This can be done using the rbind (row) and
cbind (column) functions:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## v1 1 2 3 4 5 6 7 8 9 10
## v2 101 102 103 104 105 106 107 108 109 110
## v1 v2
## [1,] 1 101
## [2,] 2 102
## [3,] 3 103
## [4,] 4 104
## [5,] 5 105
## [6,] 6 106
## [7,] 7 107
## [8,] 8 108
## [9,] 9 109
## [10,] 10 110
We will do a lot more work with matrices later, particularly mathematical operations.
So far we have talked about vectors and matrices. Often we want to collect these things and put them into one object under a single variable. For example:
You can see that each item is given a name ‘char’ ‘numb’ before it is
put into the list. Each element can now be accessed via
$:
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0739463 -1.7360640 1.5116693 -0.5342028 0.5660760
## [2,] 0.9510137 -1.6779363 -0.7514769 0.2854680 -0.4626121
## [3,] -0.5922803 -2.3704785 -0.1807808 -0.3283688 0.9303043
## [4,] 1.1481838 2.2808569 1.0741713 -0.7689942 -1.7375805
## [5,] 1.0099163 0.4815679 1.3071015 -0.8105483 -0.3071693
## [6,] -0.1080959 1.0470289 1.6107538 -0.5996738 -1.2994476
## [7,] -2.1318033 -0.6550634 0.2903355 0.3471615 -0.6828171
## [8,] -2.3158137 0.1428056 0.3078578 -0.5628889 -0.5318038
## [1] 1.0739463 -1.7360640 1.5116693 -0.5342028 0.5660760
Another way of doing the above is:
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0739463 -1.7360640 1.5116693 -0.5342028 0.5660760
## [2,] 0.9510137 -1.6779363 -0.7514769 0.2854680 -0.4626121
## [3,] -0.5922803 -2.3704785 -0.1807808 -0.3283688 0.9303043
## [4,] 1.1481838 2.2808569 1.0741713 -0.7689942 -1.7375805
## [5,] 1.0099163 0.4815679 1.3071015 -0.8105483 -0.3071693
## [6,] -0.1080959 1.0470289 1.6107538 -0.5996738 -1.2994476
## [7,] -2.1318033 -0.6550634 0.2903355 0.3471615 -0.6828171
## [8,] -2.3158137 0.1428056 0.3078578 -0.5628889 -0.5318038
## [1] 1.0739463 -1.7360640 1.5116693 -0.5342028 0.5660760
Lists are ok, but they can become like the wild-west where things are thrown in with little organisation. They are fine for small things, but big data shouldn’t be stored in them, or become the basis for a large project. You will see why later.
Data frames can be thought of as a list where each column is of the same length. It allows you to mix different classes (characters, numerics) in the same container. For example, the iris data which comes with R:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 150 5.9 3.0 5.1 1.8 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 11 5.4 3.7 1.5 0.2 setosa
## 104 6.3 2.9 5.6 1.8 virginica
## 9 4.4 2.9 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 120 6.0 2.2 5.0 1.5 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 66 6.7 3.1 4.4 1.4 versicolor
## 36 5.0 3.2 1.2 0.2 setosa
You see the first few column are numeric, and the last character strings. Data frames are the primary input into things like tidyverse etc, and can be addressed by their names just as you would in a list:
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
We can also break up a data frame up on attributes and store the contents in a list (which we have already discussed). For example by species:
## [1] "setosa" "versicolor" "virginica"
You have to get the data into R first before you can analyse it (this helps a lot). R has many useful functions to do this, so now we can take our first look at some expression data. Download this file (https://github.com/shambam/R_programming_1/blob/main/Mouse_HSPC_reduced.txt) and save it to your current working directory.
Exercise: Open the file in Excel or
something to see how it looks, and then call
help(read.delim) in your console. Try to work out how the
file you are looking at could be read into R using this function.
This is how I would do it:
We can now look at a few aspects of the data:
## [1] "LTHSC.1" "LTHSC.2" "LTHSC.3" "LTHSC.4" "LTHSC.5" "LTHSC.6"
## [7] "LTHSC.7" "LTHSC.8" "LTHSC.9" "LTHSC.10" "LTHSC.11" "LTHSC.12"
## [13] "LTHSC.13" "LTHSC.14" "LTHSC.15" "LTHSC.16" "LTHSC.17" "LTHSC.18"
## [19] "LTHSC.19" "LTHSC.20" "LTHSC.21" "LTHSC.22" "LTHSC.23" "LTHSC.24"
## [25] "LTHSC.25" "LTHSC.26" "LTHSC.27" "LTHSC.28" "LTHSC.29" "LTHSC.30"
## [31] "LTHSC.31" "LTHSC.32" "LTHSC.33" "LTHSC.34" "LTHSC.35" "LTHSC.36"
## [37] "LTHSC.37" "LTHSC.38" "LTHSC.39" "LTHSC.40" "LTHSC.41" "LTHSC.42"
## [43] "LTHSC.43" "LTHSC.44" "LTHSC.45" "LTHSC.46" "LTHSC.47" "LTHSC.48"
## [49] "LTHSC.49" "LTHSC.50" "MEP.1" "MEP.2" "MEP.3" "MEP.4"
## [55] "MEP.5" "MEP.6" "MEP.7" "MEP.8" "MEP.9" "MEP.10"
## [61] "MEP.11" "MEP.12" "MEP.13" "MEP.14" "MEP.15" "MEP.16"
## [67] "MEP.17" "MEP.18" "MEP.19" "MEP.20" "MEP.21" "MEP.22"
## [73] "MEP.23" "MEP.24" "MEP.25" "MEP.26" "MEP.27" "MEP.28"
## [79] "MEP.29" "MEP.30" "MEP.31" "MEP.32" "MEP.33" "MEP.34"
## [85] "MEP.35" "MEP.36" "MEP.37" "MEP.38" "MEP.39" "MEP.40"
## [91] "MEP.41" "MEP.42" "MEP.43" "MEP.44" "MEP.45" "MEP.46"
## [97] "MEP.47" "MEP.48" "MEP.49" "MEP.50" "GMP.1" "GMP.2"
## [103] "GMP.3" "GMP.4" "GMP.5" "GMP.6" "GMP.7" "GMP.8"
## [109] "GMP.9" "GMP.10" "GMP.11" "GMP.12" "GMP.13" "GMP.14"
## [115] "GMP.15" "GMP.16" "GMP.17" "GMP.18" "GMP.19" "GMP.20"
## [121] "GMP.21" "GMP.22" "GMP.23" "GMP.24" "GMP.25" "GMP.26"
## [127] "GMP.27" "GMP.28" "GMP.29" "GMP.30" "GMP.31" "GMP.32"
## [133] "GMP.33" "GMP.34" "GMP.35" "GMP.36" "GMP.37" "GMP.38"
## [139] "GMP.39" "GMP.40" "GMP.41" "GMP.42" "GMP.43" "GMP.44"
## [145] "GMP.45" "GMP.46" "GMP.47" "GMP.48" "GMP.49" "GMP.50"
## [1] 4170
## [1] 150
## [1] 4170 150
## [1] "LTHSC.1" "LTHSC.2" "LTHSC.3" "LTHSC.4" "LTHSC.5" "LTHSC.6"
## [7] "LTHSC.7" "LTHSC.8" "LTHSC.9" "LTHSC.10" "LTHSC.11" "LTHSC.12"
## [13] "LTHSC.13" "LTHSC.14" "LTHSC.15" "LTHSC.16" "LTHSC.17" "LTHSC.18"
## [19] "LTHSC.19" "LTHSC.20" "LTHSC.21" "LTHSC.22" "LTHSC.23" "LTHSC.24"
## [25] "LTHSC.25" "LTHSC.26" "LTHSC.27" "LTHSC.28" "LTHSC.29" "LTHSC.30"
## [31] "LTHSC.31" "LTHSC.32" "LTHSC.33" "LTHSC.34" "LTHSC.35" "LTHSC.36"
## [37] "LTHSC.37" "LTHSC.38" "LTHSC.39" "LTHSC.40" "LTHSC.41" "LTHSC.42"
## [43] "LTHSC.43" "LTHSC.44" "LTHSC.45" "LTHSC.46" "LTHSC.47" "LTHSC.48"
## [49] "LTHSC.49" "LTHSC.50" "MEP.1" "MEP.2" "MEP.3" "MEP.4"
## [55] "MEP.5" "MEP.6" "MEP.7" "MEP.8" "MEP.9" "MEP.10"
## [61] "MEP.11" "MEP.12" "MEP.13" "MEP.14" "MEP.15" "MEP.16"
## [67] "MEP.17" "MEP.18" "MEP.19" "MEP.20" "MEP.21" "MEP.22"
## [73] "MEP.23" "MEP.24" "MEP.25" "MEP.26" "MEP.27" "MEP.28"
## [79] "MEP.29" "MEP.30" "MEP.31" "MEP.32" "MEP.33" "MEP.34"
## [85] "MEP.35" "MEP.36" "MEP.37" "MEP.38" "MEP.39" "MEP.40"
## [91] "MEP.41" "MEP.42" "MEP.43" "MEP.44" "MEP.45" "MEP.46"
## [97] "MEP.47" "MEP.48" "MEP.49" "MEP.50" "GMP.1" "GMP.2"
## [103] "GMP.3" "GMP.4" "GMP.5" "GMP.6" "GMP.7" "GMP.8"
## [109] "GMP.9" "GMP.10" "GMP.11" "GMP.12" "GMP.13" "GMP.14"
## [115] "GMP.15" "GMP.16" "GMP.17" "GMP.18" "GMP.19" "GMP.20"
## [121] "GMP.21" "GMP.22" "GMP.23" "GMP.24" "GMP.25" "GMP.26"
## [127] "GMP.27" "GMP.28" "GMP.29" "GMP.30" "GMP.31" "GMP.32"
## [133] "GMP.33" "GMP.34" "GMP.35" "GMP.36" "GMP.37" "GMP.38"
## [139] "GMP.39" "GMP.40" "GMP.41" "GMP.42" "GMP.43" "GMP.44"
## [145] "GMP.45" "GMP.46" "GMP.47" "GMP.48" "GMP.49" "GMP.50"
Exercise: Using subsetting we learnt about
earlier, make three new matrices called lthsc,
mep and gmp to separate the cell types shown
in the headings. For this look at the help page for a function called
grep.
lthsc <- hspc.data[, grep("LTHSC", colnames(hspc.data))]
mep <- hspc.data[, grep("MEP", colnames(hspc.data))]
gmp <- hspc.data[, grep("GMP", colnames(hspc.data))]To write a table use the write.table function:
Exercise: Write out the data for the MEP and GMP data into two files.
The data tables we have now are in the form of a data.frame. Try:
## [1] "data.frame"
This can be an awkward format for some operations so we can convert it to a simple matrix first:
hspc.data <- as.matrix(hspc.data)
lthsc <- as.matrix(lthsc)
mep <- as.matrix(mep)
gmp <- as.matrix(gmp)Try this now:
## [1] "matrix" "array"
This is where things get more interesting and we start to feel like proper programmers. Now that we have these datasets loaded in R, we can use them to learn about flow control and some basic mathematical functions. We are going to do a few things the “long way” so you get the idea of how flow control works, and then we’ll look at some shortcuts.
Flow control is how multi-step processes are carried out. In the example below we print out the numbers 1 to 10:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
To translate this code, it simply says for every integer from 1 to 10, print this value to the screen.
Exercises: - Using the example above, print
the first 10 lines of lthcs in a for loop.
mep from lines
1 to 50.An important point regarding for loops is that any
processes/calculations occurring within the loop will stay in the loop.
If data generated within a loop has to be retained, we need to create a
container to “fill up” while the loop is being carried out.
The container vec is initialised outside the loop, and
then populated by concatenating to it after every iteration of the
loop.
Exercise: Initialise an empty container,
and for gmp, calculate the mean of each row (gene), and
store the results in the container you made.
Functions are chunks of code that execute several lines of code to perform a task. Once you have a few lines of useful code that you want to apply repeatedly, a function is a nice way to wrap them up so it can be used quickly when needed. lets turn the code you wrote in the previous exercise into a function where we also calculate the variance for a gene.
calc.mean.and.sd <- function(mat){
mn <- NULL
vr <- NULL
for(i in 1:nrow(mat)){
mn <- c(mn, mean(mat[i,]))
vr <- c(vr, var(mat[i,]))
}
res <- list(mns = mn, vars = vr)
res # the last line in a function is what the function will return. You can also be more explicit and say return(res)
}You can see a loop is started, and the output from each loop is put
into variables mn and vr. These are then put
into a list which is returned at the end. By putting this code into a
function we can now calculate the means and deviations of any matrix.
For example, gmp:
Functions can also work with built-in conditions:
animal.maths <- function(value1, value2, animal = c("pig", "cow")){
if(animal == "pig"){
print(value1/value2)
}
if(animal == "cow"){
print(value1*value2)
}
}
animal.maths(5, 5, "pig")## [1] 1
## [1] 25
The above can be simplified a bit further since there are only two
options pig and cow by using the
else statement:
animal.maths <- function(value1,value2,animal=c("pig", "cow")){
if(animal == "pig"){
print(value1/value2)
}else{
print(value1*value2)
}
}
animal.maths(5, 5, "pig")## [1] 1
## [1] 25
In the case above we’re only concerned if one of the arguments is
pig, anything else goes:
## [1] 25
These functions can now be “banked” for use whenever they are needed
(probably not animal.maths to be fair). However,
you should avoid using for-loops etc altogether since R has some built
in functions that are much quicker and tidier. Lets look at that
now.
‘apply’ is a commonly used function in R to speed up matrix calculation. For example, to calculates means of a matrix we can do this:
lthsc.row.mn <- apply(lthsc, 1, mean) # means of rows
lthsc.col.mn <- apply(lthsc, 2, mean) # means of columnsThe format for the function is therefore (1) the matrix, (2) the direction in which you would like to apply a function, and (3) the function to be applied.
Exercise: Use apply to
calculate row and column totals and deviations for a yeast dataset of
your choosing.
Your own functions can also be used with apply when used
as the 3rd argument. Example:
example.func <- function(v){
val <- (mean(v)*sd(v))/sum(v) ## This is a nonsense operation.
val
}
ex.apply <- apply(mep, 1, example.func)Lets use the apply function to get the top 500 most variable genes in our HSPC dataset:
Lets stick with our expression data and cluster it. Doing so, we’ll learn more of the language, and some of the fundamental maths that runs underneath. Lets take a look at the range of the data, i.e. getting the lowest and highest value in the matrix of variable genes we just made.
## [1] 0.0000 15.7097
For some operations (such as making heatmaps) the data needs to be
z-score normalised (scaled) first. When we scale data, each row of gene
is standarised so that it’s mean = 0 and sd = 1. Specifically for a gene
g of the i-th row:
\[Z_i=\frac{g_i-\hat{g}}{\sigma g_i}\]
Which means for every gene (\(g_i\)) we subtract the mean expression of the gene (\(\hat{g)}\)) to each expression value (\(g_i\)) and divide by the standard deviation of the expression of the gene (\(\sigma g_i\)).
Exercise: write a function called
zscore which will take a single vector of values and scale
them. When you have done this, apply this to the
hspc.var matrix to scale all rows and call it
hspc.zs.
Now take a look at the first row of the normalised data. Call
nrow on the matrix. Does it look right?
We can see now the data has been centralised around 0.
Exercise: Do some Googling and see if there
is a built in function in R that will calculate zscores for you. Use it
to zscore your data and save the output in hspc.zs.v2.
Lets make sure the same thing has been returned by comparing the first row:
## [1] "names for target but not for current"
All the steps before have been leading to this, clustering the data and making heatmaps - a staple method in bioinformatics.
Clustering is one of the most common visualisation techniques for genes expression data. Here we will learn how to do some basic histograms/heatmaps and plotting.
R has many ways to do this, and many packages have been written
specifically for expression data. We are not going to use these for now,
but concentrate on the basic underlying functions that do the maths. For
example, the gplots package uses the hclust
function which is provided by R. So we will use hclust for
now.
To use hclust we need to provide a distance matrix. This
is done using the dist function:
We then cluster using hclust:
Plot the dendrogram:
You’ll see form this what we have clustered are the genes. If you
want to cluster the cells then you need to transpose the matrix using
t():
hspc.dst.cells <- dist(t(hspc.zs)) #transpose the matrix here!
hspc.hc.cells <- hclust(hspc.dst.cells)
plot(hspc.hc.cells)We can see this is pretty much useless. It is far to compact and
doesn’t really tell us anything. What we would like is to make a heatmap
where the genes and samples are clustered. To do this we need to
retrieve some information created by hclust
Call names to see which information is available in the
newly created object:
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
What we need here is the component called order. We can
get this using the $ assignment.
## [1] 51 8 80 95 54 85 91 59 76 87 97 52 66 62 63 82 55 64
## [19] 61 75 100 58 60 94 79 30 70 56 86 84 92 106 78 71 73 74
## [37] 81 99 68 88 53 69 131 129 107 102 132 145 113 126 109 115 133 105
## [55] 117 116 141 147 120 104 118 111 101 143 108 114 2 128 148 140 149 134
## [73] 123 122 150 124 125 119 144 103 112 138 121 130 137 135 139 146 25 27
## [91] 21 6 31 16 13 15 26 50 7 38 136 33 46 11 37 20 23 34
## [109] 17 14 12 18 1 49 10 29 43 32 35 3 9 28 39 40 42 47
## [127] 44 36 45 41 19 48 5 24 142 110 4 98 22 65 127 90 67 72
## [145] 83 77 57 89 93 96
This is the order the cells appear in form left to right when you plotted the dendrogram of cells just before. We use this to reorder the z-scored matrix, but also using the gene order too:
To make a heatmap of the data call image:
Ok, this doesn’t look like it should! The matrix is the wrong way round, the colours aren’t right, and there are no labels. One downside to R is that getting all this done takes time and knowledge of R’s plotting capabilities. Thankfully people have already done this and put the code into convenient functions/packages for people to install and use.
Exercise: Install the pheatmap
package.
Load it first:
We can now use the pheatmap function that the package
provides:
The pheatmap function uses hclust to cluster the genes
and cells and reorders the matrix according to both. Lets output this to
a file:
## png
## 3
The file is opened, the plot is made, and dev.off()
closes and finalises the file (i.e. nothing more can be written to
it).
Exercise: call help (pheatmap)
and see what options are available. Play with the options to see what
they do.
Lets pimp the heatmap by stripping it of useless stuff and adding a colour bar (Raquel’s code):
library(stringr) # install this first if you don't have it.
coldata <- data.frame(samples = colnames(hspc.zs)) # make a data frame
coldata$type <- sapply(str_split(colnames(hspc.zs), "[[.]]"), `[[`, 1) #split the names around the "." and use this to define the cell type.
rownames(coldata) <- coldata$samples # add rownames to the data frame
pheatmap(hspc.zs,
annotation_col = coldata[,c("type"),drop=F],
show_colnames = F,
show_rownames = F, treeheight_row = 0,
treeheight_col = 0)Beautiful!
Clustering is pretty pointless if you can’t define groups and get to
the gene names. First we need to capture the output from
pheatmap as a variable:
Lets take a look at the contents of
hspc.clust:
## [1] "tree_row" "tree_col" "kmeans" "gtable"
What we want is the information contained within the hclust object in
tree_row. We get this by treating it like a list:
##
## Call:
## hclust(d = d, method = method)
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 500
## [1] 164 413 306 314 293 490 387 491 425 476 276 489 434 458 309 360 315 467
## [19] 324 184 340 372 405 385 469 205 127 354 246 429 249 465 423 485 454 192
## [37] 368 399 418 400 492 254 449 464 494 123 496 199 313 461 327 272 193 495
## [55] 367 395 379 497 126 478 353 150 347 444 37 43 63 76 39 56 83 248
## [73] 380 237 312 65 204 73 394 303 439 374 381 377 180 200 220 427 136 455
## [91] 487 189 233 432 185 145 167 463 302 247 451 252 337 97 228 155 144 258
## [109] 28 107 152 383 195 142 138 271 161 325 176 277 113 472 398 173 243 244
## [127] 373 250 296 417 488 190 384 234 364 131 460 80 154 263 351 175 159 259
## [145] 334 281 236 441 361 414 401 477 500 121 280 285 257 422 471 301 198 338
## [163] 279 284 346 445 238 355 483 270 298 402 158 262 151 240 171 209 420 98
## [181] 299 57 67 331 147 53 69 124 330 288 349 61 102 129 265 305 153 29
## [199] 187 106 72 99 78 40 108 55 197 235 256 357 49 66 307 160 484 468
## [217] 88 437 282 297 169 239 320 275 177 479 386 216 392 225 462 378 415 179
## [235] 345 352 470 85 409 304 486 370 457 212 62 90 231 323 70 203 318 82
## [253] 382 410 52 130 111 115 213 294 122 103 41 19 4 13 14 27 58 10
## [271] 174 94 46 35 8 24 33 165 132 51 30 31 128 12 34 105 322 369
## [289] 214 393 141 75 223 343 101 391 81 362 140 438 117 396 201 137 311 79
## [307] 435 245 266 283 433 316 371 450 388 499 224 32 229 269 92 221 366 100
## [325] 498 211 426 133 226 242 77 333 143 162 166 264 125 317 109 230 448 16
## [343] 59 157 342 84 206 480 222 64 114 290 219 411 475 170 363 407 202 278
## [361] 253 421 207 344 348 38 186 86 11 9 18 146 319 328 459 112 452 210
## [379] 96 329 188 341 365 68 178 474 48 291 183 45 119 54 134 74 7 215
## [397] 1 5 36 104 20 15 2 3 21 17 23 135 218 332 408 289 292 268
## [415] 404 326 390 191 308 44 358 419 406 456 403 440 397 466 217 267 300 93
## [433] 181 91 116 22 50 412 163 26 6 156 71 232 251 182 442 194 87 208
## [451] 227 339 149 473 273 389 376 120 42 60 25 89 424 443 321 172 428 47
## [469] 95 196 446 335 482 295 493 350 118 430 241 168 139 148 375 431 356 447
## [487] 453 260 286 481 110 255 287 261 416 436 310 336 274 359
Lets say that we want to split the genes in to 5 clusters groups, we
can call the cutree function on an hclust
object to do this:
gene.clusters <- cutree(hspc.clust$tree_row, k = 5)
gene.clusters[1:20] # shows the results for the first 20 genes.## Elane Ctsg Mpo Car1 Ms4a3
## 1 1 1 2 1
## Mpl Ly6c2 Klf1 Nkg7 Ces2g
## 3 1 2 1 2
## Cst7 Ermap Aqp1 Gm15915 Clec12a
## 1 2 2 2 1
## Prtn3 Rab44 Plac8 Gata1 F630028O10Rik
## 1 1 1 2 1
## gene.clusters
## 1 2 3 4 5
## 90 296 53 23 38
Lets isolate all the genes beloning to cluster 1 using the
which command:
## Elane Ctsg Mpo Ms4a3 Ly6c2
## 1 2 3 5 7
## Nkg7 Cst7 Clec12a Prtn3 Rab44
## 9 11 15 16 17
## Plac8 F630028O10Rik Atp8b4 Tyrobp Anxa3
## 18 20 21 23 36
## Fcgr3 Ccl9 Alas1 Sell Emb
## 38 45 48 54 59
## Lyz2 Fes Hk3 Serpinf1 Tbc1d10c
## 64 68 74 77 84
## Adgrg3 Gstm1 Napsa Wls Igsf6
## 86 92 96 100 104
## Unc93b1 Fcer1g Tmem176a Cd53 Cd93
## 109 112 114 119 125
## Tsc22d3 BC035044 Emilin1 Anxa1 Spns3
## 133 134 135 143 146
## Gm8995 Sptbn1 Rab32 Gmpr Myo1g
## 157 162 166 170 178
## Serpinb1a Hdc Casp1 Klhl6 Anxa5
## 183 186 188 202 206
## Lgals1 Adssl1 Asah1 Hp Tcirg1
## 207 210 211 215 218
## Ifi203 Cnn2 Fkbp1a Cers5 Tmem173
## 219 221 222 226 230
## Ctsc Mapkapk3 Tcn2 Stap1 Tor3a
## 242 253 264 269 278
## Tmem176b Ms4a6c Tfec Vim Igfbp4
## 290 291 317 319 328
## Lbp Ppic Hexa Irf1 Csf2rb
## 329 333 341 342 344
## Fcgr2b Cpa3 Anxa2 Plcb2 Rasal3
## 348 363 365 366 407
## Gpr171 Idh1 Pak1 Lcp1 Parp8
## 411 421 426 448 452
## Pstpip1 Rgs2 Fam189b Adgrl4 Elmo1
## 459 474 475 480 498
We can isolate these rows only from our hspc.zs matrix
as we did before:
We can now see how these gene behave as a whole using a boxplot:
Something to always remember is that computer resources such as RAM and disk space should be used wisely. One should always aim to create fast and lean code - that is, without compromising readability. Here is a small example where we make a vector from one to 50,000 using a loop:
values <- NULL # make an empty container to catch the values
for(i in 1:50000){
values <- c(values, i)
}
values[1:10]## [1] 1 2 3 4 5 6 7 8 9 10
Lets do this again, but this time measure how long it takes using the
system.time function:
values <- NULL # make an empty container to catch the values
system.time(for(i in 1:50000){ values <- c(values, i)})## user system elapsed
## 3.204 0.000 3.205
## [1] 50000
Ok, now the same thing again, but this time we will fill a vector rather that growing it:
values <- vector("numeric",100000) # make an empty container to catch the values
system.time(for(i in 1:100000){ values[i] <- i})## user system elapsed
## 0.007 0.000 0.007
## [1] 100000
Why do you think this was so much faster?
Fitting a linear model (or linear regression) is also another fairly common thing to do, and actually forms the basis of some of the more famous packages we use such as DESeq2, limma etc. Let’s use our expression data to demonstrate this.
In our case we have 3 different samples:
## [1] "LTHSC.1" "LTHSC.2" "LTHSC.3" "LTHSC.4" "LTHSC.5" "LTHSC.6"
## [7] "LTHSC.7" "LTHSC.8" "LTHSC.9" "LTHSC.10" "LTHSC.11" "LTHSC.12"
## [13] "LTHSC.13" "LTHSC.14" "LTHSC.15" "LTHSC.16" "LTHSC.17" "LTHSC.18"
## [19] "LTHSC.19" "LTHSC.20" "LTHSC.21" "LTHSC.22" "LTHSC.23" "LTHSC.24"
## [25] "LTHSC.25" "LTHSC.26" "LTHSC.27" "LTHSC.28" "LTHSC.29" "LTHSC.30"
## [31] "LTHSC.31" "LTHSC.32" "LTHSC.33" "LTHSC.34" "LTHSC.35" "LTHSC.36"
## [37] "LTHSC.37" "LTHSC.38" "LTHSC.39" "LTHSC.40" "LTHSC.41" "LTHSC.42"
## [43] "LTHSC.43" "LTHSC.44" "LTHSC.45" "LTHSC.46" "LTHSC.47" "LTHSC.48"
## [49] "LTHSC.49" "LTHSC.50" "MEP.1" "MEP.2" "MEP.3" "MEP.4"
## [55] "MEP.5" "MEP.6" "MEP.7" "MEP.8" "MEP.9" "MEP.10"
## [61] "MEP.11" "MEP.12" "MEP.13" "MEP.14" "MEP.15" "MEP.16"
## [67] "MEP.17" "MEP.18" "MEP.19" "MEP.20" "MEP.21" "MEP.22"
## [73] "MEP.23" "MEP.24" "MEP.25" "MEP.26" "MEP.27" "MEP.28"
## [79] "MEP.29" "MEP.30" "MEP.31" "MEP.32" "MEP.33" "MEP.34"
## [85] "MEP.35" "MEP.36" "MEP.37" "MEP.38" "MEP.39" "MEP.40"
## [91] "MEP.41" "MEP.42" "MEP.43" "MEP.44" "MEP.45" "MEP.46"
## [97] "MEP.47" "MEP.48" "MEP.49" "MEP.50" "GMP.1" "GMP.2"
## [103] "GMP.3" "GMP.4" "GMP.5" "GMP.6" "GMP.7" "GMP.8"
## [109] "GMP.9" "GMP.10" "GMP.11" "GMP.12" "GMP.13" "GMP.14"
## [115] "GMP.15" "GMP.16" "GMP.17" "GMP.18" "GMP.19" "GMP.20"
## [121] "GMP.21" "GMP.22" "GMP.23" "GMP.24" "GMP.25" "GMP.26"
## [127] "GMP.27" "GMP.28" "GMP.29" "GMP.30" "GMP.31" "GMP.32"
## [133] "GMP.33" "GMP.34" "GMP.35" "GMP.36" "GMP.37" "GMP.38"
## [139] "GMP.39" "GMP.40" "GMP.41" "GMP.42" "GMP.43" "GMP.44"
## [145] "GMP.45" "GMP.46" "GMP.47" "GMP.48" "GMP.49" "GMP.50"
We need to make a vector of factors from this info to specify the different groups.
Exercise: Install the stringr
package and call it in your session.
samples <- colnames(hspc.data)
groups <- gsub("\\..*", "", samples) # This will make a new vector of names but remove the ".X" leaving just what type each sample belongs to
groups## [1] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
## [10] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
## [19] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
## [28] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
## [37] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
## [46] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "MEP" "MEP" "MEP" "MEP"
## [55] "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP"
## [64] "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP"
## [73] "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP"
## [82] "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP"
## [91] "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP" "MEP"
## [100] "MEP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP"
## [109] "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP"
## [118] "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP"
## [127] "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP"
## [136] "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP" "GMP"
## [145] "GMP" "GMP" "GMP" "GMP" "GMP" "GMP"
##
## Call:
## lm(formula = hspc.data[1, ] ~ -1 + groups)
##
## Coefficients:
## groupsGMP groupsLTHSC groupsMEP
## 2.798 2.379 3.402
Lets call anova on this:
## Analysis of Variance Table
##
## Response: hspc.data[1, ]
## Df Sum Sq Mean Sq F value Pr(>F)
## groups 3 1252.9 417.62 47.399 < 2.2e-16 ***
## Residuals 147 1295.2 8.81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, the groups factor has a significant value meaning it seems this gene is differentially expressed.
Exercise: Take a look at the
ano object and see how to isolate just the p-value from all
this info.
Exercise: Make a function called
do.anova which takes a vector vand a vector
groups, performs linear modelling and ANOVA and returns
just the p-value.
We can now apply this to all of our data:
ps <- apply(hspc.data, MARGIN = 1, FUN = function(x) do.anova(x, groups) )
p.adj <- p.adjust(ps, "hochberg")Get the top 500 genes:
Heatmap them!
pheatmap(hspc.data[sig.genes,],scale="row",
show_colnames = F,
show_rownames = F,
treeheight_row = 0,
treeheight_col = 0)There are a few rouge values in the data which is why it looks washed out. Lets set a manual range:
library(RColorBrewer)
breaksList = seq(-3, 3, by = 0.2)
pheatmap(hspc.data[sig.genes,],scale="row",breaks = breaksList,color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)))We’re going to program our first algorithm together! Kmeans clustering is a common way to cluster expression data, and the algorithm is actually pretty simple, and a great way to learn how to think programmatically. The data we’ll use comes from yeast where the expression of 256 genes have been measured as a synchronised population go through two divisions. It’s a nice, small dataset with clear clusters.
There is a file called “Spellman_Yeast_Cell_Cycle.tsv” (https://github.com/shambam/R_programming_1/blob/main/Spellman_Yeast_Cell_Cycle.tsv).
Load it into a variable called ycc and convert it to a
matrix.
ycc <- read.delim("Spellman_Yeast_Cell_Cycle.tsv", row.names = 1, header = T, sep = "\t")
ycc <- as.matrix(ycc)The data has already been z-scored, so we need to learn how to calculate the Euclidean distance between 2 genes (\(g^{i}\) and \(g^{j}\)) (vector). We do this using the formula:
\[d_{ij}=\sqrt{\sum{(g^{i}_t-g^{j}_t)^2}}\]
Write a function called e.dist which takes two vectors
v1 and v2 and calculates the distance between
them. In our data, these two vectors (v1 and
v2) are the different expression values of two genes
through the different time points measured in the yeast cell cycle .
The kmeans algorithm needs the user to say how many clusters are being searched for, so in this case we’ll say 8.
We’ll now work through this problem together and crowd source a solution.
centroids <- ycc[sample(1:256, 8, replace=F),] # make a vector of random clusters. We are sampling 8 numbers (since we want 8 centroids) out of the vector 1:256 which represents the genes in ycc. Check that we indeed have 256 genes in ycc!
dists <- matrix(0, nrow = 256, ncol = 8) # make an empty matrix to fill with distances.
clusters <- NULL # make an empty variable to catch the clusters in the loop below
for(iteration in 1:100){ # 100 iterations
for(gene in 1:256){
for(k in 1:8){
dists[gene, k] <- e.dist(ycc[gene,], centroids[k,]) # for each gene calculate the distance to each centroid (k).
}
}
clusters <- apply(dists, 1, which.min) # assign a cluster according to which centroid is nearest
for(k in 1:8){
centroids[k,] <- apply(ycc[which(clusters == k),], 2, mean) # define new centroids.
}
}Exercise Plot your clusters using base R or ggplots. Dealers choice.
ggplots:
library(reshape2)
library(ggplot2)
ycc.cl <- cbind(as.factor(clusters), ycc)
colnames(ycc.cl)[1] <- c("Cluster")
ycc.df <- data.frame(gene = rownames(ycc.cl), ycc.cl, row.names = NULL)
ytm <- melt(ycc.df, c("gene", "Cluster"))
ggplot(ytm, aes(x = variable, y = value, group = gene)) + geom_line() + facet_wrap(~Cluster, ncol = 3)Raquels ggplot2 method:
ycc_df <- as.data.frame(ycc)
ycc_df$clusters <- clusters
library(reshape2)
library(ggplot2)
ycc_df$gene <- rownames(ycc_df)
ycc_df_melt <- melt(ycc_df, c("clusters", "gene"))
ycc_df_melt$clusters <- as.factor(ycc_df_melt$clusters)
ggplot(data = ycc_df_melt, aes(x=variable, y=value, group = gene, colour=clusters)) +
geom_line() +
facet_wrap(~clusters, scales = "free_x") +
theme_classic() +
labs(x="Time", y="Expression") base R:
par(mfrow=c(2, 4))
for(k in 1:8){
ycc.c <- ycc[which(clusters == k),]
plot(ycc.c[1,],ty="l",ylim = range(ycc.c))
apply(ycc.c, 1, lines)
}Improvements
data, any number of clusters K, and
user-defined number of iterations as arguments. Return a
list containing the cluster assignments and data using Kmeans
clustering.Don’t forget to document your functions nicely so that there is no room for confusions.
my.kmeans <- function(data,nclus,iter){ # setup a functon that will take some data, number of clusters, and max number of iterations.
centroids <- data[sample(1:nrow(data), nclus, replace=F),]
dists <- matrix(0, nrow = nrow(data), ncol = nclus)
clusters <- sample(1:nrow(centroids),nrow(data),replace=T) #make a random assignment of clusters to start off
for(iteration in 1:iter){ # start the iterations
for(gene in 1:nrow(data)){
for(k in 1:nclus){
dists[gene, k] <- e.dist(data[gene,], centroids[k,]) # for each gene calculate the distance to each centroid (k).
}
}
clusters.new <- apply(dists, 1, which.min) # get the mnew clusters assignments
if(all.equal(clusters,clusters.new)==T){ #see if it equals the previous round
break # if so, stop
}else(clusters <- clusters.new) # otherwise set the clusters to the new.clusters
for(k in 1:nrow(centroids)){
centroids[k,] <- apply(data[which(clusters == k),], 2, mean) # define new centroids.
}
print(iteration)
}
clusters.new # return the clusters.
}
clus.5 <- my.kmeans(ycc,5,100) # run it using 5 clusters
table(clus.5) # see how many genes belong to each cluster.Your final project is going to combine everything you have learned over the last few days to tackle a final project you will do in your own time over the next X weeks. It will be along the same lines the Kmeans clustering we just did, but this time we’re going to cluster the data using the Metropolis-Hastings algorithm. This is a simulated annealing process, so we need to learn how that works first.
Programming an optimisation problem is a great way of learning how to code. In this case we are going to tackle a minimisation problem. We can think of well clustered data having low energy, in that, each cluster is tight and has little within cluster variance. If we calculate the variance within each cluster, and sum over all clusters, we get the total variance (energy) of the system. Lets go back to the equation we know that measures the distance between two genes \(i\) and \(j\) over \(t\) timepoints:
\[d_{ij}=\sqrt{\sum{(g^{i}_t-g^{j}_t)^2}}\]
To measure the compactness of a clustering, we sum the pairwise distances for each cluster \(K\), then sum over all \(K\)s and them divide by \(K\).
\[ E(K)=\frac{1}{K}\sum^K_{k=1} \left[ \sum_{i\epsilon Ck}\sum_{j\epsilon Ck} d_{ij}\right] \]
For a well clustered data, \(E(K)\) should be as small as possible. Lets say we have 1000 genes, and we want to partition them into 10 clusters. The number of combinations is too high for us to try each one to brute force a true \(E\). This is why we use a heuristic algorithm to get us as close to the solution as possible in a smaller amount of time.
If we tried to visualise the energy landscape we can imagine it might look something like this:
The first thing you need to do is rescale each gene (row) of the data so the smallest value is given 0, and the highest value is given 1.
Lets go through the process of the algorithm. To do this you need 4 parameters:
Algorithm goes:
Randomly assign each gene to one of your K clusters.
for each iteration \(I\) {
Calculate \(E\). Call this \(E_{old}\)
Randomly select a single gene and assign it randomly to another cluster.
Calculate \(E_{new}\)
if(\(E_{new}\) < \(E_{old}\)) {accept the move}
if(\(E_{new} > E_{old})\) {if(\(e^{-\frac{E_{new}-E_{old}}{Temp}} > R(0,1)\)) {accept the move} }
else{reject the move}
Set a new \(Temp\) by doing \(Temp=Temp \times cool\) }
Done
\(R(0,1)\) is a randomly generated number from a uniform distribution that lies between 0 and 1.
Things to do
Tip
Try and break the problem up into functions which can be called when needed. It will be easier to write the main algorithm.
Why this exercise is a good one
By tackling a more difficult problem the idea is that you will dig deeper into the language and really understand how things work in the background of the functions you use in packages such as Seurat etc.
Are there libraries in R to do this?
Yes there are, but you won’t learn anything by using them which is why we are coding this “by hand”.
A little inspiration
Before you start, read this, and then watch this.
To make it tasty….
We’ll reconvene in X weeks and we’ll run each of your implementations on the same laptop and see who a) can get to the solution in the fastest time, and b) get the lowest \(E_{final}\).